Feature selection in wind speed forecasting systems based on meta-heuristic optimization

Technology for anticipating wind speed can improve the safety and stability of power networks with heavy wind penetration. Due to the unpredictability and instability of the wind, it is challenging to accurately forecast wind power and speed. Several approaches have been developed to improve this accuracy based on processing time series data. This work proposes a method for predicting wind speed with high accuracy based on a novel weighted ensemble model. The weight values in the proposed model are optimized using an adaptive dynamic grey wolf-dipper throated optimization (ADGWDTO) algorithm. The original GWO algorithm is redesigned to emulate the dynamic group-based cooperative to address the difficulty of establishing the balance between exploration and exploitation. Quick bowing movements and a white breast, which distinguish the dipper throated birds hunting method, are employed to improve the proposed algorithm exploration capability. The proposed ADGWDTO algorithm optimizes the hyperparameters of the multi-layer perceptron (MLP), K-nearest regressor (KNR), and Long Short-Term Memory (LSTM) regression models. A dataset from Kaggle entitled Global Energy Forecasting Competition 2012 is employed to assess the proposed algorithm. The findings confirm that the proposed ADGWDTO algorithm outperforms the literature’s state-of-the-art wind speed forecasting algorithms. The proposed binary ADGWDTO algorithm achieved average fitness of 0.9209 with a standard deviation fitness of 0.7432 for feature selection, and the proposed weighted optimized ensemble model (Ensemble using ADGWDTO) achieved a root mean square error of 0.0035 compared to state-of-the-art algorithms. The proposed algorithm’s stability and robustness are confirmed by statistical analysis of several tests, such as one-way analysis of variance (ANOVA) and Wilcoxon’s rank-sum.


Introduction
A long-term energy supply can be delivered using wind energy and thus plays a crucial role in micro-grid intelligent grid architecture as an essential low-carbon energy source. The increased utilization of wind power in power grids might substantially influence system reliability and quality, given that the generated amount of wind energy is directly proportional to the wind speed. A precise wind speed forecasting technology can improve the safety and stability of power systems [1].
Wind power generation, on the other hand, is inherently unpredictable and intermittent, providing several challenges to broader adoption. With the aid of wind speed and power generation estimations, it is possible to reduce energy balance and make production scheduling and dispatching decisions. Furthermore, projections can lower costs by anticipating demand for wind curtailments and increasing profits in power market operations. However, reliable forecasting of wind speed and power is exceedingly difficult due to the wind's unstable and unpredictable nature. A wind power prediction estimates the production of one or more wind turbines, referred to as wind farms. Forecasts may also be expressed in energy by combining power production across each period [2].
The fundamental objective of wind speed and power forecasting is to provide vital information regarding the expected wind power and speed for the following days, hours, or minutes. The four-time frames that can be classified according to power system operation needs are long-term (from seven days down to one day), medium-term (from twenty-four hours down to six hours), short-term (from six hours down to thirty minutes), and highly short-term (from thirty minutes down to a few seconds). Control of turbines and load tracking are reliant on extremely short-term forecasts. It is possible to distribute preloads using short-term forecasting. Medium-term projections are utilized for both power system management and energy trading. Using long-term estimates, maintenance strategies for wind turbines are developed [3].
The forecasting of wind speed is a time-sensitive and non-linear challenge, which motivates researchers to utilize the information contained in previous wind data. Long short-term memory (LSTM) networks, which are based on time-series data, are one of the most popular methods for making predictions [4]. Utilizing statistical and numerical weather prediction models, the topic of wind power forecasting was discussed. Two locations in Brazil leverage Brazilian advances in the regional atmospheric modeling system to simulate wind speed estimates 72 hours in advance, every ten minutes [5].
Based on a dataset of two months of recordings and fifteen minutes of sampling, authors in [6] forecast the wind power of sixteen wind farms in China based on a back-propagation neural network (BPNN), least squares support vector machine, and radial basis function NN. Authors in [3] applied deep learning NN and isolation forest (IF) to predict the wind power using SCADA data from a wind turbine located in Scotland, with a one-second sampling rate of a 12-month dataset. In Scotland, a seven-megawatt wind turbine is monitored by utilizing a twelve-month dataset with a one-second sample rate and an IF and feed-forward NN.
The authors of [7] used restricted Boltzmann machines and rough set theory to create an interval probability distribution learning (IPDL) model for capturing the unsupervised temporal characteristics of wind speed data. The IPDL model collects interval-adjustable latent variables in order to capture the probability distribution of wind speed time-series data. A realvalued interval deep belief network (IDBN) for supervised regression of future wind speed data was developed using the IPDL model and fuzzy type II inference system. Deep neural network (DNN) architecture with stacked denoising auto-encoder (SDAE) and stacked autoencoder was created for wind speed forecasting by the creators of [8].
Wind speed time series forecasts were generated using the temporal features retrieved from the network nodes. The authors presented deep convolutional learning (GCDL) in [9] as a scalable framework for learning strong Spatio-temporal characteristics from nearby wind farms using wind direction and speed data. Their model included the rough set theory and the GCDL architecture. Authors in [10] provided a framework for improving the architecture and hyperparameters of the LSTM deep learning model for predicting wind speed based on an upgraded grasshopper optimization method.
To predict short-term wind speed, researchers in [11] used wavelet transform variants and a variety of support vector regression (SVR). To find the optimal regressor for wind forecasting applications, they tested their suggested methodologies using a variety of performance metrics. Random forests, convolutional neural networks (CNN), discrete wavelet transform (DWT), and Twin SVR were utilized by authors in [12] for wind forecasting. The wavelet transform was used to improve the information retrieved from wind speed. In addition, authors in [13] developed an adaptive threshold and twin SVM (TWSVM) approach for detecting the anomaly problem in wind turbine gearboxes. Among the most modern methods for predicting wind power are shown in Table 1.
The authors of [14] proposed a novel framework for electrical power system forecasting based on MODA (multi-objective dragonfly algorithm). In this study, the MODA method was used to optimize a modified Elman neural network (ENN) model. The tested dataset was collected at two observation locations in Penglai, China, over the course of 37 days, at a sampling rate of 10 minutes per site. In [15], the authors proposed an Artificial Neural Network-based wind turbine power output forecast model (ANN). Their data was collected at three additional sites along the northwest coast of Senegal between 6 and 9 months, with a sampling rate between 1 and 10 minutes for each sample. Inspired by the localized first-order approximation of spectral graph convolutions, a scalable graph convolutional deep learning architecture (GCDLA) employs extracted temporal features to predict the wind-speed time series of the complete network nodes [9]. The simulation findings based on 145 wind stations in the Northern States of the United States for six years with a sampling rate of five minutes demonstrate the benefits of capturing spatial and temporal interval information at a deep level.
On the basis of LSTM and the Gaussian mixture model (GMM), short-term forecasting and uncertainty analysis of wind turbine output were provided using a dataset of 123 wind farm units in north China for three months with a 15-minute sample rate [16]. In [17], on the basis of a Taiwanese wind farm, a hybrid deep learning-based neural network for 24-h wind power forecasting with a 60-minute sample rate is presented. The authors of [18] proposed ensemble wind speed forecasting utilizing deep learning and an adaptive dynamic optimization algorithm with a sample rate of 60 minutes for 18 months. In order to improve wind speed forecasts, this research makes use of a novel optimization algorithm that is referred to as ADGWDTO. This algorithm is built on the grey wolf and dipper throated optimization techniques. Even though it is easy to use and strikes an excellent balance between exploration and exploitation, Grey Wolf Optimization (GWO) [19] has a few drawbacks, including a low exploration rate and a performance drop when there are a lot of different local optimum solutions. These issues arise when there are a lot of different possible solutions. The performance of the Dipper Throated Optimization (DTO) [20] method deteriorates due to the fact that it is dependent on a large number of variables during the optimization process. In addition to this, the algorithm's convergence has been achieved too soon. However, a significant advantage is presented by the fact that there is a healthy equilibrium between exploration and utilization. The DTO algorithm is used in the suggested strategy so that users can make use of the benefit that is offered. This research makes use of the dipper throated optimizer, which is an algorithm, to make the most of the benefits offered by this method while also accounting for its limits.
Therefore, the purpose of this research is to provide a brand new ensemble model that utilizes an innovative meta-heuristic optimization approach in order to make forecasts regarding the speed of the wind. The suggested ensemble model is made up of three different machine learning regression models. These models are the multi-layer perceptron (MLP), the k-nearest regressor (KNR), and the long short-term memory (LSTM). Utilizing the proposed novel optimization method that is referred to as ADGWDTOO and is based on the grey wolf and dipper throated optimization algorithms results in an improvement in the performance of the proposed ensemble model. In order to make accurate forecasts of wind speed, the ADGWDTO method that was recently developed is used to optimize the hyper-parameters of the regression models as well as the weighted ensemble model. A dataset taken from the Kaggle global energy forecasting competition [21] is used to predict the hourly power output up to 48 hours in advance. This is done so that the effectiveness of the methodology that has been proposed may be evaluated.
The feature selection process of the wind power forecasting dataset is solved using a new binary-based ADGWDTO algorithm. Compared to algorithms, such as Genetic Algorithm (GA) [22], Firefly Algorithm (FA) [23], Particle Swarm Optimization (PSO) [24], Whale Optimization Algorithm (WOA) [25][26][27], Grey Wolf Optimizer (GWO) [19], Dipper Throated Optimization (DTO) [20], the proposed algorithm is confirmed to achieve the best performance. In addition, comparisons are made between the proposed ensemble model and three other ensemble models to demonstrate its superiority and efficacy. ANOVA and Wilcoxon's rank-sum tests are conducted to validate the accuracy of the proposed methods.
The following is an explanation of the primary contributions made by this work: • In this paper, we propose a brand new adaptive dynamic grey wolf-dipper throated optimization (ADGWDTO) technique.
• In order to choose features from the dataset that was put through testing, a binary ADGWDTO method, which is a binary variant of the suggested technique, is used.
• AIn order to enhance the accuracy of the tested dataset's classification, a weighted optimal ensemble model has been built. This model is based on the ADGWDTO technique that was proposed.
• The Wilcoxon rank-sum test and the ANOVA test are used to evaluate the statistical significance of the ADGWDTO algorithm.
• The ADGWDTO algorithm is used to improve the performance of classification methods for the goals of classifying data so that it can be used in new applications.
• Both the binary ADGWDTO technique and the classification algorithm that is based on regression models can be generalized and evaluated for a wide variety of datasets.
The paper structure for the subsequent sections is as follows: The literature review for the procedures and materials is presented in Section 2. The methods proposed are then detailed in section 3. Section 4 presents and analyzes the experimental outcomes. In Section 5, the conclusions and future directions are presented.

Methods and materials
This section will examine MLP, KNR, and LSTM fundamental models. The ensemble model technique will also be introduced to illustrate how it works with fundamental models. The methods for adaptive grey wolf optimization and dipper throated optimization will also be described.

Multi-layer perceptron (MLP)
Several artificial neural networks (ANNs) can be utilized for classification and prediction. ANNs can simulate the discovery of data patterns or sets of cause-and-effect variables by employing transient detection, approximation, time-series forecasting, and pattern recognition approaches. In ANNs, a group of neurons are densely connected and operate together to solve regression and classification problems in various fields [28]. MLP is a type of ANNs in which neurons are organized in the form of layers referred to as input, hidden, and output layers. The weighted sum of a neuron's output value is computed as follows [29]: where I i is an input, w ij is neuron j and input I i connection weight. β j is the bias value. The output of a neuron j can be calculated as follows: where a sigmoid function is used, and the f j (S j ) value can be used to get the output of the network as where w jk represents the output node k and hidden layer neuron j weight. β k refers to the output layer bias value.

K-Nearest Neighbor Regressor (KNR)
Using the utilized distance measures, KNR depends on historical occurrences that are most comparable to the current state in order to make predictions. Predictions are generated using a weighted average based on the K nearest neighbors. KNR uses Euclidean distance as a metric to measure a distance between X train and X t est sets as follows. The prediction results of the test data are generated using the following equation: where w j refers to the weight of the jth neighbor. The value of this weight is adjusted using the observed data. For the number of training data denoted by n, the value of w j is measured as w j = j/n.

Long short term memory (LSTM)
According to [30], LSTM is an improved ANN model that may be used to solve various issues.
The key benefit of the LSTM is its ability to retain information over an extended period. Fig 1  depicts the LSTM design in all its nifty glory. Decisions on which cell state data to reject are made in the LSTM model's initial phase. Eq 6 describes the usage of a sigmoid layer for this purpose.
In the next stage, the cell state will be updated with new input data. New candidates are selected by the sigmoid layer and added to the produced state in Eqs 7 and 8 as shown in this section. The cell previous state denoted by C t−1 parameter is then updated to a new state referred to as C t parameter in Eq 9 based on Eqs 6-8.
The final stage is to make a choice regarding the final product. It is the sigmoid layer's job to determine which cell state portions should be outputted. The sigmoid gate output is then multiplied by tanh and force values between [−1, 1] in the cell state.

Ensemble models
The basic objective of ensemble models is to combine the capabilities of multiple individual base models into a unified model with enhanced performance. Several methods can be followed to realize this approach of ensemble models. Resampling the training set, for example, serves as an effective strategy, while other techniques use different prediction methods or modify specific parameters of a predictive model [31]. This article proposes a weighted ensemble model composed of three machine learning models, MLP, KNR, and LSTM. The weights of the ensemble model are optimized using a new optimization approach discussed in the following sections. On the other hand, other ensemble models, such as average and SVR ensemble, are used in the experiments conducted to show the proposed ensemble's effectiveness.

Adaptive grey wolf optimization
Despite its widespread usage in optimization, the original GWO has been shown to have several shortcomings and limitations. These downsides include early convergence, limited precision, and an inability to locate the ideal solution. You can easily get stranded and locked in the local optima, created by wolves' leader alpha, beta, and gamma, all converging to the same solution. This can be pretty dangerous. As a result, the three leaders constantly change their positions in response to each other. The GWO's capacity to organize and handle the complicated search space is limited. A further problem with this design is the inability of GWO to properly balance exploratory work with operational work as experimental is carried out first, and functional work is carried out second. To put things into perspective, getting out of the local optima in the final GWO iteration would be a challenge and an impediment. As a result, the search for an optimal answer may become empty. In addition, the GWO algorithm's performance is strongly impacted by the number of variables, which is attributable to the initial population of a local solution.
The original GWO algorithm is redesigned in this work to emulate the dynamic groupbased cooperative to address the difficulty of establishing the balance between exploration and exploitation. There are three solutions in the grey wolf optimization: alpha (S α ) which is the best solution, followed by beta (S β ), delta (S δ ). The other solutions retrieved by the algorithm are denoted by (S γ ). The following is the formulation of the grey wolf optimization.
Sðt þ 1Þ ¼ S p ðtÞ À A:jC:S p ðtÞ À SðtÞj ð11Þ where S represents the agent position, and t is the current iteration. S p (t) is the best agent (prey) position and A and C are defined as follow.
where r 1 and r 2 are randomly selected values in [0, 1], and a is selected in [0, 2] with a linearly decreasing. To control the balance between the exploitation and exploration, the value of a is updated as follows based on the available iterations M t .
The process of agent position updating is described using the following equations based on the three fittest solutions, S α , S β , and S δ .
where A 1 , A 2 , and A 3 are calculated by Eq (12). C 1 , C 2 , and C 3 are calculated by Eq (13). The new position of population agents is determined by the following equation.
The global minimum finding is a complex undertaking. Two methods to accomplish its task by the GWO: exploration and exploitation. Discovering exciting places in the search space is the process of exploration; on the other hand, finding better spots close to previously successful solutions is the exploitation optimization algorithms benefit from exploration because it keeps them from being stuck in local optimums. Search space exploration is encouraged in the early stages of an optimization algorithm's development. Finally, in subsequent rounds, agents utilize the knowledge gained to find the global minimum. There were two groups of agents in the adaptive GWO's population division: group n 1 and group n 2 . The GWO is redesigned to emulate the dynamic group-based cooperative to address the difficulty of establishing the balance between exploitation and exploration. Algorithm 1 presents the adaptive GWO algorithm in detail.
1: Initialize population S i (i = 1, 2, . . ., n) with size n, fitness function F n , and iterations M t . 2: Initialize parameters a, A, C, and t = 1 3: Calculate F n for each agent S i 4: Get best, second best and third best agents as S α , S β , S δ 5: while t < M t do 6: Update exploration group (n 1 ) and exploitation group (n 2 ) for n = n 1 + n 2 7: if (Best F n is the same for the last three iterations) then 8: Increase exploration group agents (n 1 ) 9: Decrease exploitation group agents (n 2 ) 10: end if 11: for (i = 1 : i � n 1 ) do 12: Calculate

Dipper throated optimization
Birds of the Cinclids family are known for their bobbing or dipping motions while perched, such as the Dipper Throated bird. To distinguish a bird from other passerines is to allow it to dive, swim, and hunt below the water's surface. It charges recklessly into the turbulent or fastflowing water to catch its prey. Pebbles and stones picked up by the trawler kill little fish and invertebrates that live in the water. The great white shark moves around the ocean floor with the help of its hands. It can dive deep into the water and immerse itself for a long time while utilizing its wings to drive it through the water. In the Dipper-Throated Optimization (DTO) approach, a flock of birds is assumed to swim in search of food [20].
A white breast and quick bowing movements, which distinguish the dipper throated birds hunting method, are employed to improve the proposed algorithm in this work exploration capability. The following matrices represent the locations and velocities of the birds.

A ¼
where A i,j represents the position of i th bird in the j th dimension.
where the fitness score reflects the agent's quest for food, the superior value indicates the mother bird. In the DTO algorithm, the bird's position and velocity of the agents are updated as follows for A best represents the best solution and other birds (follower birds) are indicated as where X and Y are calculated as in the following equations.
where B(t + 1) is calculated as Bðt þ 1Þ ¼ K 3 BðtÞ þK 4 r 1 ðA best ðtÞ À AðtÞÞ where A Gbest indicates the global best solution. t is iteration number, and B(t + 1) represents the agent's velocity at iteration i + 1. K 1 , K 2 , and K 3 are variable weight values while, K 4 , and K 5 are constants. r 1 and r 2 are selected randomly in [0, 1]. The parameters of the classification neural network will be improved using the continuous DTO algorithm, while a binary version of the DTO algorithm is used to select features. The DTO algorithm is explained in Algorithm 2 [20]. Algorithm 2: DTO Algorithm. if (R < 0.5) then 8: Update agent position as in Eq 21 9: else 10: Update agent velocity as in Eq 23 11: Update agent position as in Eq 22 12: end if 13: end for 14: Get h for all agents A i 15: Update K 1 , K 2 , R, t = t + 1 16: Find best agent A best 17: Set A Gbest = A best 18: end while 19: Return A Gbest

The proposed methodology
The optimization of the machine learning models and the proposed ensemble models is conducted using the provided and discussed optimization algorithm proposed in this section. The suggested optimization approach is based on the adaptive dynamic grey wolf dipper throated optimization (ADGWDTO) algorithm, which divides the population into two groups, as explained in the following sections. The proposed optimization algorithm's steps are detailed in Algorithm 3.

Exploration group
This particular group is in charge of the exploration process, which aims to locate potentially fruitful places within the search space. It is also responsible for ensuring that the ADGWDTO does not get stuck in a local optimum and for obtaining the fact that the organization implements two different tactics.
3.1.1 Mutation. It is used to ensure the diversity of the population, which permits the ADGWDTO optimizer to search in various search spaces.
3.1.2 Explore around the solution. The candidate searches in search space around the promising regions surrounding its position in search space by utilizing the following equations to find the optimal fitness.
Pðt þ 1Þ ¼ P best ðiÞ À K 1 jK 2 P best ðiÞ À PðiÞj if R < 0:5 Algorithm 3: The proposed ADGWDTO algorithm. In each group, Update the number of solutions 9: if best fitness did not improve fro 3 iterations then 10: Increase in the exploration group solutions number 11: Mutate the solution by K ¼ 1 À 2k�Xt 2 SolutionsÀ count 2

Exploitation group
This group is responsible for exploitation, which is the act of locating better spots near existing good solutions; to accomplish this, ADGWDTO employs two strategies.

Moving towards the best solution.
Using the following equation, the individual works toward the optimal solution: 3.2.2 Search around the leader. The individuals search around the leader and that is because it increases the probability of obtaining a better solution ADGWDTO do that by using the following equations: Sðt þ 1Þ ¼ SðtÞ þ D:ð2r 5 À 1Þ ð28Þ The the velocity of the agent, V(t+ 1), is calculated at iteration i + 1 as where P best (t) is the best bird position. The K 1 , K 2 , and K 3 are variable weights, while K 4 and K 5 are constants. r 1 and r 2 are randomly selected in [0, 1].

Adaptive dynamic approach
Fitness values are calculated for each solution in a population upon initialization of the optimization process. As a result of this, the best agent is selected by the optimization algorithm. The optimization algorithm begins the adaptive dynamic process by dividing the population of agents into the exploration group and exploitation group. The exploration group's primary goal is to locate the leaders. The exploitation group's primary goal is finding the best or most optimum solution. There is a constant exchange of information between the population groupings' agents. The algorithm starts with a population with half its number in the exploration group and the other half in the exploitation group. The number of agents in each of the two groups should be balanced and dynamically changed throughout multiple iterations to acquire the best or most optimum solution.

Responsive exploration
The ADGWDTO starts populating using a variety of different solutions. The ADGWDTO calculates its best solution through the usage of the fitness function. And then, it divides the population into two separate groups, group A for exploration and group B for exploitation. In the beginning, the ADGWDTO divides the population by 70% for group A which is responsible for the exploration task. Group B takes 30% which is accountable for the exploitation task. As mentioned above, group A takes the most significant percentage at the beginning to accomplish the most incredible amount of search exploration. But what is to note is that this percentage changes dynamically during the iteration. As with each iteration, the ADGWDTO examines the convergence and best solution of the current iteration relative to the two preceding iterations. If the optimal solution has remained unchanged for three iterations in a row, the number of solutions in group A will be increased to facilitate exploration. Moreover, this will help to avoid local optima. All this makes the ADGWDTO more responsive to the changes during the iteration to achieve the balance between exploring the search space and finding the good point around the best solution. This results in avoiding being caught in the local optimum and locating the most likely optimal solution.

Elitism
To guarantee the convergence quality throughout iterations, an elitism is added to the proposed ADGWDTO. Elitism allows the best agent from the current generation to carry over to the next, unaltered. This guarantees that the solution quality obtained by the ADGWDTO will not decrease from one generation to the next.

Exploration-exploitation balance
The ADGWDTO needs to strike a healthy balance between exploitation and exploration, and one way to do this is by regularly altering the population number. The algorithm starts by placing half of the population in the exploration group and the other half in the exploitation group. It then makes adjustments based on the results of these two groups' activities. When doing the early rounds of the optimization process, it is helpful to have a significant proportion of individuals participating in the exploration group. This makes it easier to investigate the potentially fruitful areas of the search space. The number of people in the exploitation group continues to increase over time, while the number of people in the exploration group continues to decrease dynamically over time. This allows more people to improve their overall fitness by enabling more people in the exploitation group to improve their fitness. In addition to this, it uses elitism as a method to keep the process's leader in consecutive populations, assuring convergence in the process in the event that a better solution cannot be found for those new populations. There is a possibility that the leader's fitness will not improve significantly for the next three iterations in a row, which could lead to challenges with local optima and stagnation. As a direct result of this, ADGWDTO might increase the number of people in the exploratory group.

Binary optimizer
The output solution of the proposed ADGWDTO should be converted to binary [0, 1] for feature selection. The most common method to make this conversion is using the sigmoid function, which can convert an optimizer's continuous solution to a binary solution. ; where S refers to the best position, and t is the iteration number. The phases of the proposed binary ADGWDTO algorithm are displayed in Algorithm 4.

Algorithm 4:
The proposed binary bADGWDTO algorithm. Find best agent (S) 5: Get the binary value of S using Eq (31) 6: Calculate Fitness 7: Update Positions and velocities of the best agents, t = t + 1 8: end while 9: Return best solution

Fitness function
The proposed optimization algorithm's performance is evaluated using a fitness function. The fitness function is influenced by the selected features and the error rate of prediction. The selected features with reduced error rates and fewer features are a better example of a successful feature selection. In the suggested feature selection approach, the following fitness function is utilized.

Num of selected features Total num of features ð32Þ
where w 1 2 [0, 1] and w 2 = 1 − w 1 which are used to manage the significance of the number of selected features for a population of size n and the error rate of categorization.
If it is possible to give a subset of features that is capable of creating a low classification error rate, then the method can be considered adequate. The k-nearest neighbor technique, sometimes known as k-NN, is an easy classification method that is commonly used. The utilization of the k-nearest neighbors classifier in this method ensures that the characteristics that were selected are of high quality. The shortest distance between the query instance and the training examples is the only factor that is utilized in the process of determining classifiers. No model for the K-nearest neighbors is utilized in this experiment.

Complexity analysis
For population agents n and iterations Max iter , the complexity analysis of the ADGWDTO algorithm is expressed as follows.
• Calculation of fitness function F n for each agent X i : O(n).
• Find the best agent: O(n).

Experimental results
This section presents and discusses the experimental conditions and findings based on the proposed ADGWDTO algorithm for wind speed prediction. The section then covers the outcomes of three scenarios: feature selection, ensemble model, and comparison to rival methods. The proposed ADGWDTO algorithm is evaluated using benchmark functions F1 through F7 [32]. Appendix A displays the mean and standard deviation, convergence curves, ANOVA and T-test results for the benchmark functions.

Dataset
The tests are based on a dataset for wind power forecasting to anticipate the future hourly power generation at seven wind farms for up to 48 hours. The dataset used is titled "Global Energy Forecasting Competition 2012-Wind Forecasting" and is available on Kaggle [21]. This dataset contains seven features for seven wind farms, including wind speed and wind direction. The correlation between the features of the dataset is shown in Fig 2.

Dataset preprocessing.
As the recordings of the wind features might contain missing values of the wind features, it is crucial to preprocess the dataset before training the machine learning models. To deal with the missing values, the previous and next non-missing values are averaged and used to set the values of the lost recordings. On the other hand, scaling and normalizing dataset values are essential to guarantee that the machine learning model considers all features similarly. This article employs the min-max scaler as a fundamental data scaling approach, including scaling and bounding data features between 0 and 1. The following equation expresses the min-max scaler used in this article.

Evaluation criteria
The achieved results are assessed in terms of the criteria presented in Tables 2 and 3. The criteria listed in Table 2 are used to evaluate the performance of the proposed feature selection method, whereas the criteria listed in Table 3 are used to assess the prediction results achieved by the proposed algorithm. In addition, in these tables, the number of runs of the proposed and other competing optimizers is indicated as M, and S � j represents the best agent at run Table 2. Evaluation metrics used in assessing the proposed feature selection method.

Metric Value
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi  Table 3. The evaluation metrics used in assessing the prediction results based on the proposed optimization algorithm.

Metric Value
RMSE ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi number. sizeðS � j Þ indicates best solution vector size. N is the number of test set points. Predicted and actual values are represented byV n and V n , respectively.

Feature selection results
This experiment aims to demonstrate the efficacy and efficiency of the proposed binary bADGWDTO algorithm for feature selection. The evaluation metrics offered in Table 2 are utilized to evaluate the outcomes attained by the proposed algorithm to those attained by competing methods, such as bGWO [19], bPSO [33], bGWOPSO [24], bGA [22], bGWOGA [34], binary bat algorithm (bBA) [35], bWOA [25], binary biogeography optimization (bBBO) [36], binary Multiverse Optimization (bMVO) [37], binary Satin Bowerbird Optimizer (bSBO) [38], and binary Firefly Algorithm (bFA) [39]. The configuration parameters of the proposed algorithm are listed in Table 4 and the configuration parameters of other algorithms are presented in Table 5. The evaluation of the results achieved by the suggested optimization approach and other competing methods is presented in Table 6. As shown in this table, the bADGWDTO algorithm offered achieves the best results compared to other approaches.

Ensemble prediction results
The features selected by the proposed bADGWDTO are used to train a new ensemble model composed of three regression models: MLP, KNR, and LSTM. The participation of the prediction results generated by these regression models in predicting the final value of the wind speed is weighted and optimized using the proposed ADGWDTO. The weights of the three regression models are optimized using the proposed optimizer and then averaged to generate the final results. Table 7 presents the evaluation results of the proposed ensemble using ADGWDTO with comparison to the base regression models and two other ensemble models, namely, average ensemble and ensemble using support vector regression (SVR). The table shows that the proposed optimized ensemble achieves the best results when measured based on all the evaluation criteria presented earlier. These results confirm the superiority of the proposed approach in predicting the wind speed more robustly. In order to demonstrate the efficacy of the suggested optimization algorithm, the proposed ensemble model is optimized using GA, FA, PSO, WOA, GWO, and DTO in addition to the proposed optimization technique. The optimized ensemble model results are shown in Table 8. This table displays an analysis of the outcomes obtained by the optimizers-based optimized ensemble. In the first column of the table are the outcomes of the proposed method. These results demonstrate robust and superior performance in comparison to the optimization ensembles of other optimizers. These results demonstrate that the suggested optimization procedure is superior to previous methods for determining the optimal ensemble model parameters.

Statistical analysis
To prove the stability and significance of the proposed algorithm, two types of statistical tests were performed, namely the one-way analysis of variance (ANOVA) test and the Wilcoxon rank-sum test. In the ANOVA test, the mean, μ, values of null hypothesis represented by H0 includes μADGWDTO = μGA = μFA = μPSO = μWOA = μGWO = μDTO. Table 9 displays the ANOVA test's measured values. Using the Wilcoxon rank-sum test, the p-values of the proposed ADGWDTO algorithm are compared to those of alternative optimization  approach and other algorithms, the p-values are less than 0.05. These results demonstrate the statistical relevance of the suggested optimization procedure.

Visual results
Fig 3 shows the predicted and actual wind speed values mapping by using the proposed weighted optimized ensemble models and the three base regression models. The figure shows that the proposed approach's results fit a high accuracy line. However, fitting the other mapping fit line with distracted points affects the accuracy of the regression model. Therefore, the proposed approach results can be considered more accurate than the other methods. Figs 4-6 depict a series of visual plots representing the residual, homoscedasticity, and QQ, ROC, Heatmap, RMSE, and histogram of RMSE, respectively. The residual error lies within the range of -0.02 to +0.02 and the homoscedasticity values lie within the range of -0.001 to +0.003, demonstrating the robustness of the suggested method. In addition, the QQ plot demonstrates that the projected results match the actual values, validating the robustness of the suggested method. The ROC curves illustrate the maximum area under the curve attained by the suggested approach versus DTO and GWO. In addition, the heatmap and RMSE graphs demonstrate that the proposed optimization approach is superior.
Moreover, the histogram RMSE plot shows the number of RMSE values achieved by the proposed optimization algorithm and other optimization methods. It can be noted from this figure that the smallest RMSE values are performed by the proposed approach with the highest number of occurrences. These plots emphasized the findings previously discussed and clearly show the effectiveness and superiority of the proposed method.

Conclusion
A new meta-heuristic optimization-based method for improving the parameters of a weighted average ensemble model for forecasting wind speed in wind farms is presented in this paper. Through a mixture of the grey wolf optimizer and dipper throated optimization algorithms, the suggested algorithm achieves a better balance between exploitation and exploration groups of the optimization process. As a case study to demonstrate the efficacy of the proposed algorithm, the Kaggle dataset for wind power forecasting is used to estimate the hourly wind speed for the following 48 hours. Alternatively, a novel binary ADGWDTO algorithm is proposed to choose the significant features for improving the accuracy of prediction. Comparisons are made between the performance of the suggested algorithms and that of other feature selection techniques. The second series of experiments are done to compare the performance of the optimization algorithm against that of various regression and ensemble models. The comparison experiments contain two more ensembles, the average and support vector regressionbased ensemble models. In addition, statistical analysis employing ANOVA and Wilcoxon's rank-sum tests is conducted to confirm the importance of the proposed method. The experimental results based on several evaluation criteria proved the proposed method's effectiveness, superiority, and robustness compared to state-of-art optimization approaches. The potential future work can be including various datasets to emphasize the generalization of the proposed algorithms for other fields such as Constrained engineering, classification, and feature selection challenges. Multiple approaches, such as sparse auto-encoding, can be compared with the proposed model in future work.

Appendix
ADGWDTO is assessed for benchmark functions (F1 through F7) [32] in this appendix, as indicated in Table 11. Fig 7 compares the algorithm's convergence curves to those of

PLOS ONE
competing algorithms for benchmark functions. The rapid convergence of the suggested algorithm, as seen in this figure, demonstrates how the suggested approach improves the capability of exploration. Fig 8 is a box plot comparing the proposed method to competing algorithms for the seven benchmark functions. Table 12 presents the mean and standard deviation of the  recommended and compared benchmark function algorithms (F1 to F7). The outcomes of the ANOVA test for the reference functions are shown in Table 13. The T-test for the benchmark functions (F1 through F7) using the recommended algorithm versus the compared techniques is presented in Table 14. The results illustrate the efficacy of the proposed methodology.